Credit Card Approval Prediction

Group 4: Zhaohui (Emily) Fan, Amber Moon, Hansol Kim

DATASET

Kaggle Datasource: https://www.kaggle.com/rikdifos/credit-card-approval-prediction

Other Context: https://mp.weixin.qq.com/s/upjzuPg5AMIDsGxlpqnoCg

Business Context

Credit score cards are a common risk control method in the financial industry. The credit card data contains personal information and data submitted by credit card applicants that we can use to predict whether the bank is able to issue a credit card to the applicant. There are two data files used, the application data and the credit data which includes monthly credit card account status information.

Challenges Adressed in Analysis

Explain the business problem clearly you are trying to solve using machine learning and data mining. Remember, you need to approach this as if you were presenting to your boss, a CEO, or a Board of Directors. (Don't afraid to show your technical and communication skills. You need to select relevant business questions to answer)

In [1]:
# import libraries and data
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd   
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE
import itertools

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, plot_confusion_matrix, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
In [2]:
applications_df = pd.read_csv('application_record.csv')
credit_df = pd.read_csv('credit_record.csv')
In [3]:
# plt.rcParams['figure.facecolor'] = 'white'

Exploratory Analysis & Preprocessing

Primary business question: which features comprise a potentially risky candidate to extend credit to?

For the purposes of this assignment, we will assume that a unique id equates to a unique persona to be included in modeling. Therefore, data shaping is based on unique ID's that appear in each dataset.

In [4]:
# For Applications dataset: 
# View shape, nulls, cardinality, duplication (unique count of ID)
applications_df.shape
Out[4]:
(438557, 18)
In [5]:
applications_df.isnull().sum()/applications_df.shape[0]
Out[5]:
ID                     0.00000
CODE_GENDER            0.00000
FLAG_OWN_CAR           0.00000
FLAG_OWN_REALTY        0.00000
CNT_CHILDREN           0.00000
AMT_INCOME_TOTAL       0.00000
NAME_INCOME_TYPE       0.00000
NAME_EDUCATION_TYPE    0.00000
NAME_FAMILY_STATUS     0.00000
NAME_HOUSING_TYPE      0.00000
DAYS_BIRTH             0.00000
DAYS_EMPLOYED          0.00000
FLAG_MOBIL             0.00000
FLAG_WORK_PHONE        0.00000
FLAG_PHONE             0.00000
FLAG_EMAIL             0.00000
OCCUPATION_TYPE        0.30601
CNT_FAM_MEMBERS        0.00000
dtype: float64
In [6]:
# For Credit Records dataset: 
# View shape, nulls, cardinality, duplication (unique count of ID)
credit_df.shape
Out[6]:
(1048575, 3)
In [7]:
credit_df.isnull().sum()
Out[7]:
ID                0
MONTHS_BALANCE    0
STATUS            0
dtype: int64
In [8]:
print("Unique ID's in credit records dataset:")
print(credit_df['ID'].nunique())
print('')
print("Unique ID's in applications dataset:")
print(applications_df['ID'].nunique())
Unique ID's in credit records dataset:
45985

Unique ID's in applications dataset:
438510

Addressing Null Values

In [9]:
# 30% of Occupation Type column is null
applications_df['OCCUPATION_TYPE'].value_counts().sort_values().plot(kind='barh', figsize=(7,5))
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e61fcb2888>

Because Occupation Type is a categorical variable, some possible options for addressing nulls are:

1) replace nulls with the mode 2) ignore observations 3) create a new category within variable 4) predict the observation

However, since our dataset includes a comparable variable called Income Type, we will drop Occupation Type altogether.

In [10]:
applications_df.drop('OCCUPATION_TYPE', axis=1, inplace=True)

We also see above that the Credit Records dataset has a long structure with 45,985 unique values in 1,048,575 rows, while the Applications dataset has minimal duplication.

Therefore, to merge the two into one de-duplicated dataset with our target prediction variable we will:

1) Identify how many IDs are common to both datasets (36,457)

2) Reshape Credit Records by Grouping by ID

3) Identify IDs where at any point in time there was a balance 60+ days overdue

In [11]:
len(set(applications_df['ID']).intersection(set(credit_df['ID'])))
id_index = list(set(applications_df['ID']).intersection(set(credit_df['ID'])))
applications_df = applications_df[applications_df['ID'].isin(id_index)]
print(applications_df.shape)
(36457, 17)
In [12]:
credit_df['dep_value'] = None
credit_df['dep_value'][credit_df['STATUS'] =='2']='Yes' 
credit_df['dep_value'][credit_df['STATUS'] =='3']='Yes' 
credit_df['dep_value'][credit_df['STATUS'] =='4']='Yes' 
credit_df['dep_value'][credit_df['STATUS'] =='5']='Yes' 
In [13]:
resp=credit_df.groupby('ID').count()
resp['dep_value'][resp['dep_value'] > 0]='Yes' 
resp['dep_value'][resp['dep_value'] == 0]='No' 
resp = resp[['dep_value']]
In [14]:
df=pd.merge(applications_df,resp,how='inner',on='ID')
df['target']=df['dep_value']
df.loc[df['target']=='Yes','target']=1
df.loc[df['target']=='No','target']=0
df.head().T
Out[14]:
0 1 2 3 4
ID 5008804 5008805 5008806 5008808 5008809
CODE_GENDER M M M F F
FLAG_OWN_CAR Y Y Y N N
FLAG_OWN_REALTY Y Y Y Y Y
CNT_CHILDREN 0 0 0 0 0
AMT_INCOME_TOTAL 427500 427500 112500 270000 270000
NAME_INCOME_TYPE Working Working Working Commercial associate Commercial associate
NAME_EDUCATION_TYPE Higher education Higher education Secondary / secondary special Secondary / secondary special Secondary / secondary special
NAME_FAMILY_STATUS Civil marriage Civil marriage Married Single / not married Single / not married
NAME_HOUSING_TYPE Rented apartment Rented apartment House / apartment House / apartment House / apartment
DAYS_BIRTH -12005 -12005 -21474 -19110 -19110
DAYS_EMPLOYED -4542 -4542 -1134 -3051 -3051
FLAG_MOBIL 1 1 1 1 1
FLAG_WORK_PHONE 1 1 0 0 0
FLAG_PHONE 0 0 0 1 1
FLAG_EMAIL 0 0 0 1 1
CNT_FAM_MEMBERS 2 2 2 1 1
dep_value No No No No No
target 0 0 0 0 0
In [15]:
#rename columns
df.rename(columns={'CODE_GENDER':'Gender','FLAG_OWN_CAR':'Car','FLAG_OWN_REALTY':'Reality','CNT_CHILDREN':'Children','AMT_INCOME_TOTAL':'income',
'NAME_EDUCATION_TYPE':'education','NAME_HOUSING_TYPE':'housing','FLAG_EMAIL':'email','NAME_INCOME_TYPE':'income_type','CNT_FAM_MEMBERS':'family_size',
},inplace=True)

Visualization & Descriptive Analysis

In [16]:
# How do income and education relate to credit status?
plt.figure(figsize=(5,5))
sns.barplot(x="income", y="education", hue="dep_value", data=df)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e625b30608>
In [17]:
plt.figure(figsize=(5,5))
sns.stripplot(x="dep_value", y="income", data=df, jitter=True)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e6262a12c8>
In [18]:
plt.figure(figsize=(5,15))
sns.catplot(x="dep_value", y="income", hue="Gender", kind="box", data=df)
Out[18]:
<seaborn.axisgrid.FacetGrid at 0x1e6262e9e88>
<Figure size 360x1080 with 0 Axes>
In [19]:
# add column for age in years
df['age'] = df['DAYS_BIRTH']/-365
# add column for years in workforce
df['years_working'] = df['DAYS_EMPLOYED']/-365
# drop columns we aren't going to use
df = df.drop(columns={'ID', 'FLAG_MOBIL', 'FLAG_WORK_PHONE', 'FLAG_PHONE', 'email', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'NAME_FAMILY_STATUS'})
In [20]:
# create subset risky_df to look for relationships within only "high risk"
# "high risk" = at any point in time had status 60 days or more overdue
risky = ['1']
risky_df = df.loc[df['target'].isin(risky)]

# Within risky subset, where can relationships be identified?
plt.figure(figsize=(10,20))
sns.pairplot(risky_df)
Out[20]:
<seaborn.axisgrid.PairGrid at 0x1e62643be88>
<Figure size 720x1440 with 0 Axes>
In [21]:
# Within risky subset, what are the relationships between income, age, gender and credit status?
plt.figure(figsize=(10, 10))
sns.lmplot(x="income", y="age", hue="Gender", data=risky_df)
Out[21]:
<seaborn.axisgrid.FacetGrid at 0x1e6285cae88>
<Figure size 720x720 with 0 Axes>
In [22]:
# Income by gender of applicants in dataset
sns.catplot(data=df.sort_values("income"), orient="h", kind="box", x="income", y="Gender", height=3, aspect=2)
Out[22]:
<seaborn.axisgrid.FacetGrid at 0x1e6285dd548>
In [23]:
# Family Size by gender of applicants in dataset
sns.catplot(data=df.sort_values("family_size"), orient="h", kind="box", x="family_size", y="Gender", height=3, aspect=2)
Out[23]:
<seaborn.axisgrid.FacetGrid at 0x1e629358948>
In [24]:
# will add a couple more charts that show other features
# prior to submitting project if time
# histograms
# heatmap, etc.

Feature Engineering

Binary Data

In [25]:
# replace M/F with 0/1
df['Gender'] = df['Gender'].replace(['F','M'],[1,0])
print(df['Gender'].value_counts())
# replace car flag with 0/1
df['Car'] = df['Car'].replace(['N','Y'],[0,1])
print(df['Car'].value_counts())
# replace realty with 0/1
df['Reality'] = df['Reality'].replace(['N','Y'],[0,1])
print(df['Reality'].value_counts())
1    24430
0    12027
Name: Gender, dtype: int64
0    22614
1    13843
Name: Car, dtype: int64
1    24506
0    11951
Name: Reality, dtype: int64

Categorical Data

In [26]:
# Convert continuous income to category
df['income'] = pd.cut(df['income'], bins=3, labels=["low", "medium", "high"])
# Convert continuous Children to category
df.loc[df['Children'] >= 2,'Children']='2orMore'
# Convert continuous Family Size to category
df.loc[df['family_size'] >= 3,'family_size']='3orMore'
# Convert continuous Age to category
df['age'] = pd.cut(df['age'], bins=3, labels=["young", "middle_age", "older"])
# Convert continuous Years Working to category
df['years_working'] = pd.cut(df['years_working'], bins=3, labels=["entry", "mid_career", "seasoned"])
df.head(2)
Out[26]:
Gender Car Reality Children income income_type education housing family_size dep_value target age years_working
0 0 1 1 0 low Working Higher education Rented apartment 2 No 0 young seasoned
1 0 1 1 0 low Working Higher education Rented apartment 2 No 0 young seasoned

One Hot Encoding

In [27]:
# convert all categories to binary
df = pd.get_dummies(df, columns=['income_type',
                                    'education',
                                    'housing',
                                      'income',
                                      'Children',
                                      'family_size',
                                      'age',
                                      'years_working'])
df.head(2)
Out[27]:
Gender Car Reality dep_value target income_type_Commercial associate income_type_Pensioner income_type_State servant income_type_Student income_type_Working ... Children_2orMore family_size_1.0 family_size_2.0 family_size_3orMore age_young age_middle_age age_older years_working_entry years_working_mid_career years_working_seasoned
0 0 1 1 No 0 0 0 0 0 1 ... 0 0 1 0 1 0 0 0 0 1
1 0 1 1 No 0 0 0 0 0 1 ... 0 0 1 0 1 0 0 0 0 1

2 rows × 36 columns

In [28]:
# rename columns
df = df.drop(columns={'dep_value'})
df.columns = ['Gender','Car','Reality','target','income_type_Commercial','income_type_Pensioner','income_type_State',
              'income_type_Student','income_type_Working','education_degree','education_Higher_ed',
              'education_Incomplete_higher','education_Lower_secondary','education_Secondary','housing_apartment',
              'housing_house','housing_municipal','housing_office_apartment','housing_rented_apartment','housing_parents',
              'income_low','income_medium','income_high','Children_0','Children_1','Children_2orMore','family_size_1',
              'family_size_2','family_size_3orMore','age_young','age_middle','age_older','years_working_entry',
              'years_working_mid_career','years_working_seasoned']
In [29]:
df['income_type_Commercial'] = df.income_type_Commercial.astype('int64')
df['income_type_Pensioner'] = df.income_type_Pensioner.astype('int64')
df['income_type_State'] = df.income_type_State.astype('int64')
df['income_type_Student'] = df.income_type_Student.astype('int64')
df['income_type_Working'] = df.income_type_Working.astype('int64')
df['education_degree'] = df.education_degree.astype('int64')
df['education_Higher_ed'] = df.education_Higher_ed.astype('int64')
df['education_Incomplete_higher'] = df.education_Incomplete_higher.astype('int64')
df['education_Lower_secondary'] = df.education_Lower_secondary.astype('int64')
df['education_Secondary'] = df.education_Secondary.astype('int64')
df['housing_apartment'] = df.housing_apartment.astype('int64')
df['housing_house'] = df.housing_house.astype('int64')
df['housing_municipal'] = df.housing_municipal.astype('int64')
df['housing_office_apartment'] = df.housing_office_apartment.astype('int64')
df['housing_rented_apartment'] = df.housing_rented_apartment.astype('int64')
df['housing_parents'] = df.housing_parents.astype('int64')
df['income_low'] = df.income_low.astype('int64')
df['income_medium'] = df.income_medium.astype('int64')
df['income_high'] = df.income_high.astype('int64')
df['Children_0'] = df.Children_0.astype('int64')
df['Children_1'] = df.Children_1.astype('int64')
df['Children_2orMore'] = df.Children_2orMore.astype('int64')
df['family_size_1'] = df.family_size_1.astype('int64')
df['family_size_2'] = df.family_size_2.astype('int64')
df['family_size_3orMore'] = df.family_size_3orMore.astype('int64')
df['age_young'] = df.age_young.astype('int64')
df['age_middle'] = df.age_middle.astype('int64')
df['age_older'] = df.age_older.astype('int64')
df['years_working_entry'] = df.years_working_entry.astype('int64')
df['years_working_mid_career'] = df.years_working_mid_career.astype('int64')
df['years_working_seasoned'] = df.years_working_seasoned.astype('int64')

Split data and fix imbalance using SMOTE

Instructor Comments

  1. Why didn't you fix the imbalance problem upfront before applying Decision Forest?
  2. Random Forest was full of bias. You can improve this by making the trees have more depth.
  3. Why didn't you chose SMOTE as a pre-processing step?
  4. When using SMOTE, only apply the synthetic samples to the training data; not testing data. Balance the training data first with SMOTE and build the model. Measure the performance of the model on test data.
In [30]:
# columns of data we will use to make classifications
X = df.drop('target', axis=1).copy()

# what we want to predict
y = df['target'].copy()
In [31]:
# Checking if the data is imbalanced or not
sum(y)/ len(y)
# 1.69% of the applicants are target users. We need to make sure we maintain the same % across training and testing datasets.
# It's called "stratification": split the data to maintain the ratio. 
Out[31]:
0.016896617933455853
In [32]:
# Using Synthetic Minority Over-Sampling Technique(SMOTE) to overcome sample imbalance problem.
X_balance,Y_balance = SMOTE().fit_sample(X,y)
X_balance = pd.DataFrame(X_balance, columns = X.columns)

X_train, X_test, y_train, y_test = train_test_split(X_balance,Y_balance, 
                                                    stratify=Y_balance, test_size=0.3,
                                                    random_state = 1024)

Logistic Regression

In [33]:
logistic_model = LogisticRegression(C=0.8,
                           random_state=0,
                           solver='lbfgs')
logistic_model.fit(X_train, y_train)
y_predict = logistic_model.predict(X_test)

print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
print(pd.DataFrame(confusion_matrix(y_test,y_predict)))
Accuracy Score is 0.67096
      0     1
0  8498  2254
1  4822  5931

Decision Tree

In [34]:
print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
# precision = (TP) / (TP+FP)
print('Precision Score is {:.5}'.format(precision_score(y_test, y_predict)))
# recall = (TP) / (TP+FN)
print('Recall Score is {:.5}'.format(recall_score(y_test, y_predict)))
Accuracy Score is 0.67096
Precision Score is 0.72462
Recall Score is 0.55157
In [35]:
%%time
clf_dt = DecisionTreeClassifier(random_state=1024)
clf_dt = clf_dt.fit(X_train, y_train)
y_predict = clf_dt.predict(X_test)

plot_confusion_matrix(clf_dt, X_test, y_test, display_labels=["No Balance > 60 Days", "Had Balance > 60 Days"])
Wall time: 462 ms
Out[35]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1e629c23708>
In [36]:
path = clf_dt.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
ccp_alphas = ccp_alphas[:-1]

clf_dts = []
for ccp_alpha in ccp_alphas:
    clf_dt = DecisionTreeClassifier(random_state=1024, ccp_alpha=ccp_alpha)
    clf_dt.fit(X_train, y_train)
    clf_dts.append(clf_dt)
In [37]:
# plot accuracy
train_scores = [clf_dt.score(X_train, y_train) for clf_dt in clf_dts]
test_scores = [clf_dt.score(X_test, y_test) for clf_dt in clf_dts]

fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test", drawstyle="steps-post")
ax.legend()
plt.show()
In [38]:
%%time
# use cross validation to find best alpha
alpha_loop_values = []
for ccp_alpha in ccp_alphas:
    clf_dt = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
    scores = cross_val_score(clf_dt, X_train, y_train, cv=5)
    alpha_loop_values.append([ccp_alpha, np.mean(scores), np.std(scores)])
    
alpha_results = pd.DataFrame(alpha_loop_values, 
                             columns=['alpha', 'mean_accuracy', 'std'])

alpha_results.plot(x='alpha', 
                   y='mean_accuracy', 
                   yerr='std', 
                   marker='o', 
                   linestyle='--')
Wall time: 7min 22s
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e62ed22708>
In [39]:
# Find exact value
alpha_results[(alpha_results['alpha'] > 0.000)
             &
             (alpha_results['alpha'] < 0.001)]
Out[39]:
alpha mean_accuracy std
1 9.013772e-09 0.804691 0.002645
2 9.581466e-09 0.804691 0.002645
3 1.830789e-08 0.804691 0.002645
4 2.787336e-08 0.804691 0.002645
5 3.874768e-08 0.804691 0.002645
... ... ... ...
499 8.309203e-04 0.702932 0.007369
500 8.326528e-04 0.702393 0.007769
501 8.857058e-04 0.697311 0.008946
502 9.361669e-04 0.695398 0.008323
503 9.645097e-04 0.692349 0.010850

503 rows × 3 columns

In [40]:
# Store ideal value for alpha
ideal_ccp_alpha = 0.001
In [41]:
%%time
clf_dt_pruned = DecisionTreeClassifier(random_state=1024, 
                                       ccp_alpha=ideal_ccp_alpha)
clf_dt_pruned = clf_dt_pruned.fit(X_train, y_train) 
y_predict_pruned = clf_dt_pruned.predict(X_test)

print(pd.DataFrame(confusion_matrix(y_test,y_predict_pruned)))
      0     1
0  6144  4608
1  2150  8603
Wall time: 283 ms
In [42]:
# confusion matrix
plot_confusion_matrix(clf_dt_pruned, 
                      X_test, 
                      y_test, 
                      display_labels=["No Balance > 60 Days", "Had Balance > 60 Days"])
Out[42]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1e62edba988>
In [43]:
print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
# precision = (TP) / (TP+FP)
print('Precision Score is {:.5}'.format(precision_score(y_test, y_predict)))
# recall = (TP) / (TP+FN)
print('Recall Score is {:.5}'.format(recall_score(y_test, y_predict)))
Accuracy Score is 0.80707
Precision Score is 0.76946
Recall Score is 0.87687
In [44]:
# Draw Decision Tree
plt.figure(figsize=(25,25))
plot_tree(clf_dt_pruned, 
           filled=True, 
         rounded=True, 
          class_names=["No Overdue", "Yes Overdue"], 
          feature_names=X.columns) 
Out[44]:
[Text(821.5399484536083, 1310.4642857142858, 'Reality <= 0.5\ngini = 0.5\nsamples = 50177\nvalue = [25089, 25088]\nclass = No Overdue'),
 Text(528.5180412371134, 1213.392857142857, 'Gender <= 0.5\ngini = 0.483\nsamples = 20238\nvalue = [8270, 11968]\nclass = Yes Overdue'),
 Text(294.819587628866, 1116.3214285714287, 'family_size_2 <= 0.5\ngini = 0.436\nsamples = 9347\nvalue = [2995, 6352]\nclass = Yes Overdue'),
 Text(186.95876288659795, 1019.25, 'housing_parents <= 0.5\ngini = 0.378\nsamples = 5689\nvalue = [1438, 4251]\nclass = Yes Overdue'),
 Text(158.1958762886598, 922.1785714285714, 'Car <= 0.5\ngini = 0.35\nsamples = 5405\nvalue = [1220, 4185]\nclass = Yes Overdue'),
 Text(57.52577319587629, 825.1071428571429, 'Children_2orMore <= 0.5\ngini = 0.281\nsamples = 3004\nvalue = [508, 2496]\nclass = Yes Overdue'),
 Text(28.762886597938145, 728.0357142857143, 'gini = 0.243\nsamples = 2771\nvalue = [393, 2378]\nclass = Yes Overdue'),
 Text(86.28865979381443, 728.0357142857143, 'age_young <= 0.5\ngini = 0.5\nsamples = 233\nvalue = [115, 118]\nclass = Yes Overdue'),
 Text(57.52577319587629, 630.9642857142858, 'gini = 0.087\nsamples = 88\nvalue = [84, 4]\nclass = No Overdue'),
 Text(115.05154639175258, 630.9642857142858, 'gini = 0.336\nsamples = 145\nvalue = [31, 114]\nclass = Yes Overdue'),
 Text(258.8659793814433, 825.1071428571429, 'income_type_State <= 0.5\ngini = 0.417\nsamples = 2401\nvalue = [712, 1689]\nclass = Yes Overdue'),
 Text(230.10309278350516, 728.0357142857143, 'Children_2orMore <= 0.5\ngini = 0.399\nsamples = 2329\nvalue = [640, 1689]\nclass = Yes Overdue'),
 Text(172.57731958762886, 630.9642857142858, 'income_type_Working <= 0.5\ngini = 0.45\nsamples = 1253\nvalue = [429, 824]\nclass = Yes Overdue'),
 Text(143.81443298969072, 533.8928571428571, 'gini = 0.319\nsamples = 762\nvalue = [152, 610]\nclass = Yes Overdue'),
 Text(201.340206185567, 533.8928571428571, 'gini = 0.492\nsamples = 491\nvalue = [277, 214]\nclass = No Overdue'),
 Text(287.62886597938143, 630.9642857142858, 'education_Higher_ed <= 0.5\ngini = 0.315\nsamples = 1076\nvalue = [211, 865]\nclass = Yes Overdue'),
 Text(258.8659793814433, 533.8928571428571, 'gini = 0.239\nsamples = 974\nvalue = [135, 839]\nclass = Yes Overdue'),
 Text(316.3917525773196, 533.8928571428571, 'gini = 0.38\nsamples = 102\nvalue = [76, 26]\nclass = No Overdue'),
 Text(287.62886597938143, 728.0357142857143, 'gini = 0.0\nsamples = 72\nvalue = [72, 0]\nclass = No Overdue'),
 Text(215.7216494845361, 922.1785714285714, 'gini = 0.357\nsamples = 284\nvalue = [218, 66]\nclass = No Overdue'),
 Text(402.680412371134, 1019.25, 'housing_house <= 0.5\ngini = 0.489\nsamples = 3658\nvalue = [1557, 2101]\nclass = Yes Overdue'),
 Text(345.1546391752577, 922.1785714285714, 'education_Higher_ed <= 0.5\ngini = 0.364\nsamples = 1153\nvalue = [276, 877]\nclass = Yes Overdue'),
 Text(316.3917525773196, 825.1071428571429, 'gini = 0.284\nsamples = 1033\nvalue = [177, 856]\nclass = Yes Overdue'),
 Text(373.9175257731959, 825.1071428571429, 'gini = 0.289\nsamples = 120\nvalue = [99, 21]\nclass = No Overdue'),
 Text(460.2061855670103, 922.1785714285714, 'Car <= 0.5\ngini = 0.5\nsamples = 2505\nvalue = [1281, 1224]\nclass = No Overdue'),
 Text(431.4432989690722, 825.1071428571429, 'gini = 0.467\nsamples = 1244\nvalue = [461, 783]\nclass = Yes Overdue'),
 Text(488.96907216494844, 825.1071428571429, 'years_working_seasoned <= 0.5\ngini = 0.455\nsamples = 1261\nvalue = [820, 441]\nclass = No Overdue'),
 Text(460.2061855670103, 728.0357142857143, 'gini = 0.435\nsamples = 231\nvalue = [74, 157]\nclass = Yes Overdue'),
 Text(517.7319587628866, 728.0357142857143, 'gini = 0.399\nsamples = 1030\nvalue = [746, 284]\nclass = No Overdue'),
 Text(762.2164948453609, 1116.3214285714287, 'housing_parents <= 0.5\ngini = 0.5\nsamples = 10891\nvalue = [5275, 5616]\nclass = Yes Overdue'),
 Text(733.4536082474227, 1019.25, 'income_type_State <= 0.5\ngini = 0.497\nsamples = 10400\nvalue = [4806, 5594]\nclass = Yes Overdue'),
 Text(704.6907216494845, 922.1785714285714, 'income_type_Working <= 0.5\ngini = 0.493\nsamples = 9806\nvalue = [4339, 5467]\nclass = Yes Overdue'),
 Text(632.7835051546392, 825.1071428571429, 'Children_1 <= 0.5\ngini = 0.47\nsamples = 5002\nvalue = [1887, 3115]\nclass = Yes Overdue'),
 Text(575.2577319587629, 728.0357142857143, 'Children_0 <= 0.5\ngini = 0.458\nsamples = 4500\nvalue = [1598, 2902]\nclass = Yes Overdue'),
 Text(546.4948453608248, 630.9642857142858, 'gini = 0.28\nsamples = 458\nvalue = [77, 381]\nclass = Yes Overdue'),
 Text(604.020618556701, 630.9642857142858, 'education_Incomplete_higher <= 0.5\ngini = 0.469\nsamples = 4042\nvalue = [1521, 2521]\nclass = Yes Overdue'),
 Text(575.2577319587629, 533.8928571428571, 'age_older <= 0.5\ngini = 0.464\nsamples = 3980\nvalue = [1459, 2521]\nclass = Yes Overdue'),
 Text(517.7319587628866, 436.82142857142856, 'education_Higher_ed <= 0.5\ngini = 0.492\nsamples = 1380\nvalue = [601, 779]\nclass = Yes Overdue'),
 Text(488.96907216494844, 339.75, 'gini = 0.476\nsamples = 646\nvalue = [394, 252]\nclass = No Overdue'),
 Text(546.4948453608248, 339.75, 'gini = 0.405\nsamples = 734\nvalue = [207, 527]\nclass = Yes Overdue'),
 Text(632.7835051546392, 436.82142857142856, 'education_Higher_ed <= 0.5\ngini = 0.442\nsamples = 2600\nvalue = [858, 1742]\nclass = Yes Overdue'),
 Text(604.020618556701, 339.75, 'gini = 0.411\nsamples = 2401\nvalue = [695, 1706]\nclass = Yes Overdue'),
 Text(661.5463917525774, 339.75, 'gini = 0.296\nsamples = 199\nvalue = [163, 36]\nclass = No Overdue'),
 Text(632.7835051546392, 533.8928571428571, 'gini = 0.0\nsamples = 62\nvalue = [62, 0]\nclass = No Overdue'),
 Text(690.3092783505155, 728.0357142857143, 'age_young <= 0.5\ngini = 0.489\nsamples = 502\nvalue = [289, 213]\nclass = No Overdue'),
 Text(661.5463917525774, 630.9642857142858, 'gini = 0.458\nsamples = 329\nvalue = [117, 212]\nclass = Yes Overdue'),
 Text(719.0721649484536, 630.9642857142858, 'gini = 0.011\nsamples = 173\nvalue = [172, 1]\nclass = No Overdue'),
 Text(776.5979381443299, 825.1071428571429, 'housing_municipal <= 0.5\ngini = 0.5\nsamples = 4804\nvalue = [2452, 2352]\nclass = No Overdue'),
 Text(747.8350515463918, 728.0357142857143, 'gini = 0.5\nsamples = 4614\nvalue = [2272, 2342]\nclass = Yes Overdue'),
 Text(805.360824742268, 728.0357142857143, 'gini = 0.1\nsamples = 190\nvalue = [180, 10]\nclass = No Overdue'),
 Text(762.2164948453609, 922.1785714285714, 'gini = 0.336\nsamples = 594\nvalue = [467, 127]\nclass = No Overdue'),
 Text(790.979381443299, 1019.25, 'gini = 0.086\nsamples = 491\nvalue = [469, 22]\nclass = No Overdue'),
 Text(1114.5618556701031, 1213.392857142857, 'Car <= 0.5\ngini = 0.492\nsamples = 29939\nvalue = [16819, 13120]\nclass = No Overdue'),
 Text(949.1752577319588, 1116.3214285714287, 'Gender <= 0.5\ngini = 0.499\nsamples = 20045\nvalue = [10544, 9501]\nclass = No Overdue'),
 Text(877.2680412371134, 1019.25, 'income_type_Working <= 0.5\ngini = 0.484\nsamples = 4658\nvalue = [1915, 2743]\nclass = Yes Overdue'),
 Text(848.5051546391753, 922.1785714285714, 'gini = 0.408\nsamples = 2752\nvalue = [786, 1966]\nclass = Yes Overdue'),
 Text(906.0309278350516, 922.1785714285714, 'gini = 0.483\nsamples = 1906\nvalue = [1129, 777]\nclass = No Overdue'),
 Text(1021.0824742268042, 1019.25, 'income_type_Commercial <= 0.5\ngini = 0.493\nsamples = 15387\nvalue = [8629, 6758]\nclass = No Overdue'),
 Text(963.5567010309279, 922.1785714285714, 'education_Lower_secondary <= 0.5\ngini = 0.498\nsamples = 12877\nvalue = [6842, 6035]\nclass = No Overdue'),
 Text(934.7938144329897, 825.1071428571429, 'housing_parents <= 0.5\ngini = 0.499\nsamples = 12756\nvalue = [6722, 6034]\nclass = No Overdue'),
 Text(906.0309278350516, 728.0357142857143, 'Children_2orMore <= 0.5\ngini = 0.499\nsamples = 12643\nvalue = [6609, 6034]\nclass = No Overdue'),
 Text(848.5051546391753, 630.9642857142858, 'housing_municipal <= 0.5\ngini = 0.5\nsamples = 11964\nvalue = [6142, 5822]\nclass = No Overdue'),
 Text(819.7422680412371, 533.8928571428571, 'housing_house <= 0.5\ngini = 0.5\nsamples = 11819\nvalue = [6017, 5802]\nclass = No Overdue'),
 Text(747.8350515463918, 436.82142857142856, 'housing_rented_apartment <= 0.5\ngini = 0.262\nsamples = 412\nvalue = [64, 348]\nclass = Yes Overdue'),
 Text(719.0721649484536, 339.75, 'gini = 0.098\nsamples = 367\nvalue = [19, 348]\nclass = Yes Overdue'),
 Text(776.5979381443299, 339.75, 'gini = 0.0\nsamples = 45\nvalue = [45, 0]\nclass = No Overdue'),
 Text(891.6494845360825, 436.82142857142856, 'family_size_2 <= 0.5\ngini = 0.499\nsamples = 11407\nvalue = [5953, 5454]\nclass = No Overdue'),
 Text(834.1237113402062, 339.75, 'income_type_Working <= 0.5\ngini = 0.499\nsamples = 4960\nvalue = [2398, 2562]\nclass = Yes Overdue'),
 Text(805.360824742268, 242.67857142857156, 'education_Secondary <= 0.5\ngini = 0.476\nsamples = 2682\nvalue = [1046, 1636]\nclass = Yes Overdue'),
 Text(776.5979381443299, 145.6071428571429, 'gini = 0.338\nsamples = 957\nvalue = [206, 751]\nclass = Yes Overdue'),
 Text(834.1237113402062, 145.6071428571429, 'income_type_State <= 0.5\ngini = 0.5\nsamples = 1725\nvalue = [840, 885]\nclass = Yes Overdue'),
 Text(805.360824742268, 48.53571428571422, 'gini = 0.493\nsamples = 1581\nvalue = [696, 885]\nclass = Yes Overdue'),
 Text(862.8865979381444, 48.53571428571422, 'gini = 0.0\nsamples = 144\nvalue = [144, 0]\nclass = No Overdue'),
 Text(862.8865979381444, 242.67857142857156, 'gini = 0.483\nsamples = 2278\nvalue = [1352, 926]\nclass = No Overdue'),
 Text(949.1752577319588, 339.75, 'education_Higher_ed <= 0.5\ngini = 0.495\nsamples = 6447\nvalue = [3555, 2892]\nclass = No Overdue'),
 Text(920.4123711340206, 242.67857142857156, 'Children_0 <= 0.5\ngini = 0.499\nsamples = 5503\nvalue = [2839, 2664]\nclass = No Overdue'),
 Text(891.6494845360825, 145.6071428571429, 'gini = 0.0\nsamples = 184\nvalue = [184, 0]\nclass = No Overdue'),
 Text(949.1752577319588, 145.6071428571429, 'gini = 0.5\nsamples = 5319\nvalue = [2655, 2664]\nclass = Yes Overdue'),
 Text(977.9381443298969, 242.67857142857156, 'gini = 0.366\nsamples = 944\nvalue = [716, 228]\nclass = No Overdue'),
 Text(877.2680412371134, 533.8928571428571, 'gini = 0.238\nsamples = 145\nvalue = [125, 20]\nclass = No Overdue'),
 Text(963.5567010309279, 630.9642857142858, 'age_young <= 0.5\ngini = 0.429\nsamples = 679\nvalue = [467, 212]\nclass = No Overdue'),
 Text(934.7938144329897, 533.8928571428571, 'gini = 0.008\nsamples = 237\nvalue = [236, 1]\nclass = No Overdue'),
 Text(992.319587628866, 533.8928571428571, 'gini = 0.499\nsamples = 442\nvalue = [231, 211]\nclass = No Overdue'),
 Text(963.5567010309279, 728.0357142857143, 'gini = 0.0\nsamples = 113\nvalue = [113, 0]\nclass = No Overdue'),
 Text(992.319587628866, 825.1071428571429, 'gini = 0.016\nsamples = 121\nvalue = [120, 1]\nclass = No Overdue'),
 Text(1078.6082474226805, 922.1785714285714, 'Children_0 <= 0.5\ngini = 0.41\nsamples = 2510\nvalue = [1787, 723]\nclass = No Overdue'),
 Text(1049.8453608247423, 825.1071428571429, 'gini = 0.497\nsamples = 1094\nvalue = [592, 502]\nclass = No Overdue'),
 Text(1107.3711340206187, 825.1071428571429, 'gini = 0.263\nsamples = 1416\nvalue = [1195, 221]\nclass = No Overdue'),
 Text(1279.9484536082475, 1116.3214285714287, 'family_size_3orMore <= 0.5\ngini = 0.464\nsamples = 9894\nvalue = [6275, 3619]\nclass = No Overdue'),
 Text(1222.4226804123712, 1019.25, 'age_older <= 0.5\ngini = 0.491\nsamples = 7219\nvalue = [4089, 3130]\nclass = No Overdue'),
 Text(1193.659793814433, 922.1785714285714, 'housing_house <= 0.5\ngini = 0.5\nsamples = 5844\nvalue = [2947, 2897]\nclass = No Overdue'),
 Text(1164.8969072164948, 825.1071428571429, 'gini = 0.022\nsamples = 264\nvalue = [261, 3]\nclass = No Overdue'),
 Text(1222.4226804123712, 825.1071428571429, 'income_type_Commercial <= 0.5\ngini = 0.499\nsamples = 5580\nvalue = [2686, 2894]\nclass = Yes Overdue'),
 Text(1150.5154639175257, 728.0357142857143, 'family_size_1 <= 0.5\ngini = 0.493\nsamples = 4344\nvalue = [1912, 2432]\nclass = Yes Overdue'),
 Text(1121.7525773195878, 630.9642857142858, 'Children_1 <= 0.5\ngini = 0.482\nsamples = 3854\nvalue = [1560, 2294]\nclass = Yes Overdue'),
 Text(1092.9896907216496, 533.8928571428571, 'gini = 0.475\nsamples = 3716\nvalue = [1442, 2274]\nclass = Yes Overdue'),
 Text(1150.5154639175257, 533.8928571428571, 'gini = 0.248\nsamples = 138\nvalue = [118, 20]\nclass = No Overdue'),
 Text(1179.278350515464, 630.9642857142858, 'gini = 0.405\nsamples = 490\nvalue = [352, 138]\nclass = No Overdue'),
 Text(1294.3298969072166, 728.0357142857143, 'age_middle <= 0.5\ngini = 0.468\nsamples = 1236\nvalue = [774, 462]\nclass = No Overdue'),
 Text(1265.5670103092784, 630.9642857142858, 'family_size_1 <= 0.5\ngini = 0.494\nsamples = 659\nvalue = [292, 367]\nclass = Yes Overdue'),
 Text(1208.041237113402, 533.8928571428571, 'education_Secondary <= 0.5\ngini = 0.493\nsamples = 398\nvalue = [222, 176]\nclass = No Overdue'),
 Text(1179.278350515464, 436.82142857142856, 'gini = 0.462\nsamples = 274\nvalue = [99, 175]\nclass = Yes Overdue'),
 Text(1236.8041237113403, 436.82142857142856, 'gini = 0.016\nsamples = 124\nvalue = [123, 1]\nclass = No Overdue'),
 Text(1323.0927835051548, 533.8928571428571, 'education_Secondary <= 0.5\ngini = 0.393\nsamples = 261\nvalue = [70, 191]\nclass = Yes Overdue'),
 Text(1294.3298969072166, 436.82142857142856, 'gini = 0.0\nsamples = 46\nvalue = [46, 0]\nclass = No Overdue'),
 Text(1351.8556701030927, 436.82142857142856, 'gini = 0.198\nsamples = 215\nvalue = [24, 191]\nclass = Yes Overdue'),
 Text(1323.0927835051548, 630.9642857142858, 'gini = 0.275\nsamples = 577\nvalue = [482, 95]\nclass = No Overdue'),
 Text(1251.1855670103093, 922.1785714285714, 'gini = 0.281\nsamples = 1375\nvalue = [1142, 233]\nclass = No Overdue'),
 Text(1337.4742268041236, 1019.25, 'Gender <= 0.5\ngini = 0.299\nsamples = 2675\nvalue = [2186, 489]\nclass = No Overdue'),
 Text(1308.7113402061857, 922.1785714285714, 'age_young <= 0.5\ngini = 0.392\nsamples = 1698\nvalue = [1244, 454]\nclass = No Overdue'),
 Text(1279.9484536082475, 825.1071428571429, 'gini = 0.467\nsamples = 1090\nvalue = [685, 405]\nclass = No Overdue'),
 Text(1337.4742268041236, 825.1071428571429, 'gini = 0.148\nsamples = 608\nvalue = [559, 49]\nclass = No Overdue'),
 Text(1366.2371134020618, 922.1785714285714, 'gini = 0.069\nsamples = 977\nvalue = [942, 35]\nclass = No Overdue')]

Random Forest

In [45]:
%%time
clf_rf = RandomForestClassifier(n_estimators=250,
                              max_depth=50,
                              min_samples_leaf=16
                              )
clf_rf.fit(X_train, y_train)
y_predict = clf_rf.predict(X_test)

print(pd.DataFrame(confusion_matrix(y_test,y_predict)))
      0     1
0  8228  2524
1  2109  8644
Wall time: 11 s
In [46]:
print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
# precision = (TP) / (TP+FP)
print('Precision Score is {:.5}'.format(precision_score(y_test, y_predict)))
# recall = (TP) / (TP+FN)
print('Recall Score is {:.5}'.format(recall_score(y_test, y_predict)))
Accuracy Score is 0.78456
Precision Score is 0.774
Recall Score is 0.80387
In [47]:
clf_rf_matrix = pd.crosstab(y_test, y_predict, rownames=['Actual'], colnames=['Predicted'])
sns.heatmap(clf_rf_matrix, annot=True)
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e62f414d08>

GBM —— lightGBM

In [48]:
%%time
from lightgbm import LGBMClassifier
clf_gbm = LGBMClassifier(num_leaves=35,
                       max_depth=8, 
                       learning_rate=0.02,
                       n_estimators=250,
                       subsample = 0.8,
                       colsample_bytree =0.8
                      )
clf_gbm.fit(X_train, y_train)
y_predict = clf_gbm.predict(X_test)

print(pd.DataFrame(confusion_matrix(y_test,y_predict)))
      0     1
0  8249  2503
1  2193  8560
Wall time: 1.96 s
In [49]:
print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
# precision = (TP) / (TP+FP)
print('Precision Score is {:.5}'.format(precision_score(y_test, y_predict)))
# recall = (TP) / (TP+FN)
print('Recall Score is {:.5}'.format(recall_score(y_test, y_predict)))
Accuracy Score is 0.78163
Precision Score is 0.77375
Recall Score is 0.79606
In [50]:
# Confusion Matrix on the test data
plot_confusion_matrix(clf_gbm,
                      X_test,
                      y_test,
                      values_format='d',
                      display_labels=["No Balance > 60 Days", "Had Balance > 60 Days"])
Out[50]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1e62f50dd48>
In [51]:
#Showing important features:

def plot_importance(classifer, x_train, point_size = 25):
    #plot feature importance
    values = sorted(zip(x_train.columns, classifer.feature_importances_), key = lambda x: x[1] * -1)
    imp = pd.DataFrame(values,columns = ["Name", "Score"])
    imp.sort_values(by = 'Score',inplace = True)
    b = sns.scatterplot(x = 'Score',y='Name', linewidth = 0,
                data = imp,s = point_size, color='red')
    fig = plt.gcf()
    fig.set_size_inches(7, 15) # Change plot size
    b.set_xlabel("importance",fontsize=15)
    b.set_ylabel("features",fontsize=15)
    b.tick_params(labelsize=10)

plot_importance(clf_gbm, X_train,20)   

XG Boost

In [52]:
%%time
import xgboost as xgb

# Creating the XGBClassfier shell
clf_xgb = xgb.XGBClassifier(objective='binary:logistic', missing=None, seed=42)

'''
Instead of finding the optimal number of tress using K-Cross Validation, 
we use early stopping to stop the tree when the cost function no longer reduces.
XGBoost will do the cross validation; we just have to specifiy the 
number of rounds to exhaust with no improvement before stopping.
We use evaluation metric as Area Under Precision-Recall Curve
'''

clf_xgb.fit(X_train,
            y_train,
            verbose=True,
            early_stopping_rounds=10,
            eval_metric='aucpr',
            eval_set=[(X_test, y_test)])

# After building 99 trees, the model doesn't improve any longer. base_score=0.5
# So we check for the next 10 iterations and stop.

# make predictions for test data
y_predict = clf_xgb.predict(X_test)

# evaluate predictions
print(pd.DataFrame(confusion_matrix(y_test,y_predict)))
[0]	validation_0-aucpr:0.71119
Will train until validation_0-aucpr hasn't improved in 10 rounds.
[1]	validation_0-aucpr:0.73931
[2]	validation_0-aucpr:0.74282
[3]	validation_0-aucpr:0.75117
[4]	validation_0-aucpr:0.77929
[5]	validation_0-aucpr:0.79749
[6]	validation_0-aucpr:0.79765
[7]	validation_0-aucpr:0.80819
[8]	validation_0-aucpr:0.80953
[9]	validation_0-aucpr:0.82485
[10]	validation_0-aucpr:0.83162
[11]	validation_0-aucpr:0.83285
[12]	validation_0-aucpr:0.83837
[13]	validation_0-aucpr:0.85190
[14]	validation_0-aucpr:0.85362
[15]	validation_0-aucpr:0.85955
[16]	validation_0-aucpr:0.86464
[17]	validation_0-aucpr:0.86978
[18]	validation_0-aucpr:0.87440
[19]	validation_0-aucpr:0.87473
[20]	validation_0-aucpr:0.87534
[21]	validation_0-aucpr:0.87771
[22]	validation_0-aucpr:0.87765
[23]	validation_0-aucpr:0.87755
[24]	validation_0-aucpr:0.87752
[25]	validation_0-aucpr:0.87938
[26]	validation_0-aucpr:0.88383
[27]	validation_0-aucpr:0.88515
[28]	validation_0-aucpr:0.88606
[29]	validation_0-aucpr:0.88665
[30]	validation_0-aucpr:0.88855
[31]	validation_0-aucpr:0.88933
[32]	validation_0-aucpr:0.89149
[33]	validation_0-aucpr:0.89154
[34]	validation_0-aucpr:0.89165
[35]	validation_0-aucpr:0.89203
[36]	validation_0-aucpr:0.89343
[37]	validation_0-aucpr:0.89380
[38]	validation_0-aucpr:0.89477
[39]	validation_0-aucpr:0.89536
[40]	validation_0-aucpr:0.89561
[41]	validation_0-aucpr:0.89615
[42]	validation_0-aucpr:0.89610
[43]	validation_0-aucpr:0.89624
[44]	validation_0-aucpr:0.89621
[45]	validation_0-aucpr:0.89664
[46]	validation_0-aucpr:0.89696
[47]	validation_0-aucpr:0.89732
[48]	validation_0-aucpr:0.89738
[49]	validation_0-aucpr:0.89751
[50]	validation_0-aucpr:0.89926
[51]	validation_0-aucpr:0.90005
[52]	validation_0-aucpr:0.90007
[53]	validation_0-aucpr:0.90052
[54]	validation_0-aucpr:0.90059
[55]	validation_0-aucpr:0.90055
[56]	validation_0-aucpr:0.90068
[57]	validation_0-aucpr:0.90068
[58]	validation_0-aucpr:0.90104
[59]	validation_0-aucpr:0.90129
[60]	validation_0-aucpr:0.90153
[61]	validation_0-aucpr:0.90153
[62]	validation_0-aucpr:0.90212
[63]	validation_0-aucpr:0.90284
[64]	validation_0-aucpr:0.90310
[65]	validation_0-aucpr:0.90318
[66]	validation_0-aucpr:0.90338
[67]	validation_0-aucpr:0.90403
[68]	validation_0-aucpr:0.90416
[69]	validation_0-aucpr:0.90446
[70]	validation_0-aucpr:0.90473
[71]	validation_0-aucpr:0.90496
[72]	validation_0-aucpr:0.90525
[73]	validation_0-aucpr:0.90540
[74]	validation_0-aucpr:0.90560
[75]	validation_0-aucpr:0.90560
[76]	validation_0-aucpr:0.90611
[77]	validation_0-aucpr:0.90616
[78]	validation_0-aucpr:0.90629
[79]	validation_0-aucpr:0.90634
[80]	validation_0-aucpr:0.90656
[81]	validation_0-aucpr:0.90679
[82]	validation_0-aucpr:0.90674
[83]	validation_0-aucpr:0.90687
[84]	validation_0-aucpr:0.90687
[85]	validation_0-aucpr:0.90692
[86]	validation_0-aucpr:0.90697
[87]	validation_0-aucpr:0.90697
[88]	validation_0-aucpr:0.90693
[89]	validation_0-aucpr:0.90688
[90]	validation_0-aucpr:0.90709
[91]	validation_0-aucpr:0.90723
[92]	validation_0-aucpr:0.90726
[93]	validation_0-aucpr:0.90718
[94]	validation_0-aucpr:0.90722
[95]	validation_0-aucpr:0.90743
[96]	validation_0-aucpr:0.90744
[97]	validation_0-aucpr:0.90740
[98]	validation_0-aucpr:0.90743
[99]	validation_0-aucpr:0.90747
      0     1
0  8094  2658
1  1546  9207
Wall time: 11.4 s
In [53]:
print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
# precision = (TP) / (TP+FP)
print('Precision Score is {:.5}'.format(precision_score(y_test, y_predict)))
# recall = (TP) / (TP+FN)
print('Recall Score is {:.5}'.format(recall_score(y_test, y_predict)))
Accuracy Score is 0.80451
Precision Score is 0.77598
Recall Score is 0.85623
In [54]:
# Confusion Matrix on the test data
plot_confusion_matrix(clf_xgb,
                      X_test,
                      y_test,
                      values_format='d',
                      display_labels=["No Balance > 60 Days", "Had Balance > 60 Days"])
Out[54]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1e628e3a848>
In [ ]:
xgb.plot_importance(clf_xgb)
plt.rcParams['figure.figsize'] = [10, 15]
plt.show()
In [ ]:
## graphviz got issue with my laptop's system and unavilable for a while.
## to prevent more time consumption, will leave this graph out of the project for now
from xgboost import plot_tree
from graphviz import Digraph

# Draw XGBoost Tree
fig = plt.figure(figsize=(10, 10))
ax = fig.subplots()
xgb.plot_tree(clf_xgb, num_trees=20, ax=ax)
plt.show()

# CalledProcessError: Command '['dot', '-Tpng']' returned non-zero exit status 4294967295

Summary and Recommendations

In [ ]:
 

References & Citations

In [ ]: